In [2]:
import numpy as np
import pandas as pd
from IPython.display import Image
What we saw above is a common setup. We have $\mathbf{X}$ and $\mathbf{y}$ data from the past and $\mathbf{X}$ data for the present for which we want to predict the future $\mathbf{y}$ values.
We can generalize this notion of past / present data into what's generally called train and test data.
predict_test_values(model, x_train, y_train, x_test)
where model
is a scikit learn modelx_train
and y_train
X_test
calc_train_and_test_error(model, x_train, y_train, x_test, y_test)
x_train
and y_train
x_test
x_train
mean_squared_error
on both the train and test data.
In [3]:
def mean_squared_error(y_true, y_pred):
"""
calculate the mean_squared_error given a vector of true ys and a vector of predicted ys
"""
diff = y_true - y_pred
return np.dot(diff, diff) / len(diff)
def predict_test_values(model, X_train, y_train, X_test):
model.fit(X_train, y_train)
return model.predict(X_test)
def calc_train_and_test_error(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
return mean_squared_error(y_train, y_pred_train), mean_squared_error(y_test, y_pred_test)
In [4]:
Image(url='http://radimrehurek.com/data_science_python/plot_bias_variance_examples_2.png')
Out[4]:
The idea in regularization is that we're going to modify our loss function to penalize it for being too complex. Simple models are better.
One way to do this is to try to keep our regression coefficients small. Why would we want to do this? One intuitive explanation is that if we have big regression coefficients we'll get large changes in the predicted values from small changes in input value. That's bad. Intuitively, our predictions should vary smoothly with the data.
So a model with smaller coefficients makes smoother predictions. It is simpler, which means it will have a harder time overfitting.
We can change our linear regression loss function to help us reduce overfitting:
We won't get into details, but a ridge regression model can be optimized in much the same way as an unregularized linear regression: either with using some form of gradient descent or matrix-based solutions.
In [5]:
# Ridge Regression in scikit-learn
from sklearn import linear_model
model_ridge = linear_model.Ridge(alpha = .5)
# once it's been fit, you can look at the learned beta values of the model with: model_ridge.coef_
Partner up. On one computer:
calc_train_and_test_error
function from the previous exercise:alpha=1
on the dataset belowAdd up the absolute values of the coefficients of each model. Which is bigger?
Note: If you have a fit model called m
, then you can access a vector holding its learned coefficients with m.coef_
.
Note: Check out the functions np.sum()
and np.abs()
Discuss with your partner what's happening here
In [7]:
# load overfitting data
with np.load('data/overfitting_data.npz') as data:
x_train = data['x_train']
y_train = data['y_train']
x_test = data['x_test']
y_test = data['y_test']
model_lr = linear_model.LinearRegression()
model_ridge = linear_model.Ridge(alpha=1)
print "Linear Regression Training and Test Errors:"
print calc_train_and_test_error(model_lr, x_train, y_train, x_test, y_test)
print
print "Ridge Regression Training and Test Errors:"
print calc_train_and_test_error(model_ridge, x_train, y_train, x_test, y_test)
print
print "Sum of Linear Regression Coefficients:"
print np.sum(np.abs(model_lr.coef_))
print
print "Sum of Ridge Regression Coefficients:"
print np.sum(np.abs(model_ridge.coef_))
print
In [8]:
?linear_model.Ridge
LASSO is another regularization method. It penalizes not with the square of the regression coefficients (the $\beta$s) but with their absolute values.
LASSO has the additional property that it tends to push beta values of unimportant dimensions all the way to exactly 0. This has the beneficial property of enforcing sparsity in our model. If having lots of small coefficients leads to a simpler model, having lots of 0-valued coefficients lead to even simpler models.
\begin{eqnarray*} Loss(\beta) &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha ||\beta||_1\\ &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \sum_{d=1}^D |\beta_d|\\ \end{eqnarray*}
In [9]:
# LASSO in scikit-learn
from sklearn import linear_model
model_lasso = linear_model.Lasso(alpha = 0.5)
calc_train_and_test_error
again, calculate the training and testing error for a LASSO model with alpha=1
on the dataset from the previous exerciseLinearRegression
and Ridge
models.LinearRegression
, Ridge
, and LASSO
models.
In [10]:
# Write your code here
model_lasso = linear_model.Lasso(alpha=1)
print "Ridge Regression Training and Test Errors:"
print calc_train_and_test_error(model_lasso, x_train, y_train, x_test, y_test)
print
print "Sum of Ridge Regression Coefficients:"
print np.sum(np.abs(model_lasso.coef_))
print
In [11]:
n_disp_coefs = 10
print 'Linear Regression Coefficients:'
print model_lr.coef_[:n_disp_coefs]
print
print 'Ridge Regression Coefficients:'
print model_ridge.coef_[:n_disp_coefs]
print
print 'LASSO Coefficients:'
print model_lasso.coef_[:n_disp_coefs]
print
In [12]:
from sklearn import linear_model
model_en = linear_model.ElasticNet(alpha=0.5, l1_ratio=0.1)
# note: scikit learn's current implementation of ElasticNet isn't stable with l1_ratio <= 0.01
calc_train_and_test_error
again, calculate the training and testing error for an ElasticNet
model with alpha=1
and l1_ratio=0.5
on the dataset from the previous exercisesElasticNet
model. Compare it to the sums from the LinearRegression
, Ridge
, and LASSO
models.ElasticNet
model.
In [13]:
# Write your code here
model_en = linear_model.ElasticNet(alpha=1, l1_ratio=0.5)
print 'ElasticNet Errors:'
print calc_train_and_test_error(model_en, x_train, y_train, x_test, y_test)
print
print 'Sum of ElasticNet Coefficients'
print np.sum(np.abs(model_en.coef_))
print
n_disp_coefs = 10
print 'ElasticNet Coefficients:'
print model_en.coef_[:n_disp_coefs]
print
Now we know three types of regularization for linear regression: ridge regression, LASSO, and elastic net.
All of our regularized models had better test error that simple linear regression. But how should we choose which model to ultimatley use or which parameters to use? The answer is through careful use of cross validation.
There are many forms of cross validation, but the basic idea of each is to train your model on some data and estimate it's future performance on other data.
Aside: So far we've been calculating error on out test dataset. This is conceptually almost identical to using a validation set, but with two significant differences:
In [14]:
# a helper function for performing validation set cross validation
from sklearn.cross_validation import train_test_split
validation_portion = 0.1
seed = 1234
x_train_small, x_valid, y_train_small, y_valid = \
train_test_split(x_train, y_train, test_size=validation_portion, random_state=seed)
print 'Original Training Set Size:'
print x_train.shape, y_train.shape
print
print 'Reducted Training Set Size:'
print x_train_small.shape, y_train_small.shape
print
print 'Validation Set Size:'
print x_valid.shape, y_valid.shape
print
validation_set_error(model, x_train, y_train, validation_portion=0.1, seed=1234)
which returns the validation set estimate of the future error for the given model
. This function should:calc_train_and_test_error(model, x_train, y_train, x_test, y_test)
function to calculate training and test set errors for these validation_set_error
function to estimate the future error on the overfitting data for:alpha
= 10
In [19]:
def validation_set_error(model, x_train, y_train, validation_portion=0.1, seed=1234):
# FILL IN YOUR CODE HERE
x_train_small, x_valid, y_train_small, y_valid = \
train_test_split(x_train, y_train, test_size=validation_portion, random_state=seed)
model.fit(x_train_small, y_train_small)
y_pred_valid = model.predict(x_valid)
return mean_squared_error(y_valid, y_pred_valid)
# set up models
model_lr_valid = linear_model.LinearRegression()
model_ridge_valid = linear_model.Ridge(alpha=10)
# calculate errors
valid_portion = .1
n_seeds = 5
print "Linear Regression Training and Test Errors:"
# FILL IN YOUR CODE HERE
print calc_train_and_test_error(model_lr_valid, x_train_small, y_train_small, x_test, y_test)
print
print "Linear Regression Validation Errors:"
# FILL IN YOUR CODE HERE
print validation_set_error(model_lr_valid, x_train, y_train, validation_portion=0.1, seed=1234)
print
for seed in range(n_seeds):
print validation_set_error(model_lr_valid, x_train, y_train, validation_portion=valid_portion, seed=seed)
print
print "Ridge Regression Training and Test Errors:"
# FILL IN YOUR CODE HERE
print calc_train_and_test_error(model_ridge_valid, x_train_small, y_train_small, x_test, y_test)
print
print "Ridge Regression Validation Errors:"
# FILL IN YOUR CODE HERE
print validation_set_error(model_ridge_valid, x_train, y_train, validation_portion=0.1, seed=1234)
print
for seed in range(n_seeds):
print validation_set_error(model_ridge_valid, x_train, y_train, validation_portion=valid_portion, seed=seed)
print
K-Fold cross validation is another cross validation method for estimating the out-of-sample error of a model. It works like this:
In [20]:
Image(url='https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png')
Out[20]:
In [21]:
# scikit learn provides a useful object to help you perform kfold cross validation
from sklearn.cross_validation import KFold
n_data = len(y_train)
fold_count = 0
for train_reduced_row_ids, valid_row_ids in KFold(n_data, n_folds=4):
print
print
print "FOLD %d:" % fold_count
print "-------"
print("train_ids:\n%s\n\nvalid_ids\n%s" % (train_reduced_row_ids, valid_row_ids))
x_train_reduced = x_train[train_reduced_row_ids]
y_train_reduced = y_train[train_reduced_row_ids]
x_valid = x_train[valid_row_ids]
y_valid = y_train[valid_row_ids]
fold_count += 1
In [22]:
# NOTE: KFolds isn't random at all. It's important to shuffle your data first before using it.
from sklearn.utils import shuffle
x_train_shuffled, y_train_shuffled = shuffle(x_train, y_train)
kfold_error(model, x_train, y_train, k=4, seed=1234)
which returns the k-fold cross validation estimate of the future error for the given model
. This function should:calc_train_and_test_error(model, x_train, y_train, x_test, y_test)
function to calculate training and test set errors for these kfold_error
function with k=5 to estimate the future error on the overfitting data for:alpha
= 10
In [30]:
def kfold_error(model, x_train, y_train, k=4, seed=1234):
# FILL IN YOUR CODE HERE
# shuffle training data
x_train_shuffled, y_train_shuffled = shuffle(x_train, y_train, random_state=seed)
n_data = len(y_train)
error_sum = 0
for train_reduced_row_ids, valid_row_ids in KFold(n_data, n_folds=k):
x_train_reduced = x_train_shuffled[train_reduced_row_ids]
y_train_reduced = y_train_shuffled[train_reduced_row_ids]
x_valid = x_train_shuffled[valid_row_ids]
y_valid = y_train_shuffled[valid_row_ids]
model.fit(x_train_reduced, y_train_reduced)
y_valid_pred = model.predict(x_valid)
error_sum += mean_squared_error(y_valid, y_valid_pred)
return error_sum*1.0 / k
# set up models
model_lr_valid = linear_model.LinearRegression()
model_ridge_valid = linear_model.Ridge(alpha=10)
# calculate errors
n_seeds = 3
k = 5
print "Linear Regression Training and Test Errors:"
# FILL IN YOUR CODE HERE
print calc_train_and_test_error(model_lr_valid, x_train, y_train, x_test, y_test)
print
print "Linear Regression K-Fold Errors:"
# FILL IN YOUR CODE HERE
print
for seed in range(n_seeds):
print kfold_error(model_lr_valid, x_train, y_train, k=k, seed=seed)
print
print
print "Ridge Regression Training and Test Errors:"
# FILL IN YOUR CODE HERE
print calc_train_and_test_error(model_ridge_valid, x_train, y_train, x_test, y_test)
print
print "Ridge Regression K-Fold Errors:"
# FILL IN YOUR CODE HERE
print
for seed in range(n_seeds):
print kfold_error(model_ridge_valid, x_train, y_train, k=k, seed=seed)
print
In [32]:
[np.nan] + [1,2]
Out[32]:
In [33]:
def model_name(model):
s = model.__str__().lower()
if "linearregression" in s:
return 'LinearRegression'
elif "lasso" in s:
return 'Lasso(a=%g)' % model.alpha
elif "ridge" in s:
return 'Ridge(a=%g)' % model.alpha
elif "elastic" in s:
return 'ElasticNet(a=%g, r=%g)' % (model.alpha, model.l1_ratio)
else:
raise ValueError("Unknown Model Type")
def create_models(alphas=(.01, .03, .1, .3, 1, 3), l1_ratios=(.7, .5, .3)):
models = [linear_model.LinearRegression()]
models.extend([linear_model.Ridge(a) for a in alphas])
models.extend([linear_model.Lasso(a) for a in alphas])
models.extend([linear_model.ElasticNet(a, l1_ratio=l) for a in alphas for l in l1_ratios])
return models
def results_df(models, betas_true, x_train, y_train, x_test, y_test, k=4):
n_data, n_dim = x_train.shape
n_zeros = n_dim - len(betas_true)
betas_true = np.concatenate([betas_true, np.zeros(n_zeros)])
# fit models to training data
[m.fit(x_train, y_train) for m in models]
betas = np.vstack([betas_true] + [m.coef_ for m in models])
beta_names = ['Beta ' + str(i) for i in range(n_dim)]
# set up model names
model_names = ["True Coefs"] + [model_name(m) for m in models]
df = pd.DataFrame(data=betas, columns=beta_names, index=model_names)
# calculate training errors
y_preds = [m.predict(x_train) for m in models]
errors = [np.nan] + [mean_squared_error(y_train, y_pred) for y_pred in y_preds]
df['Train Error'] = errors
# calculate validation errors
errors = [np.nan] + [kfold_error(m, x_train, y_train, k=k) for m in models]
df['Cross Validation Error'] = errors
# calculate test errors
y_preds = [m.predict(x_test) for m in models]
errors = [np.nan] + [mean_squared_error(y_test, y_pred) for y_pred in y_preds]
df['Test Error'] = errors
return df
# these are some of the magic parameters that I used to actually
# generate the overfitting dataset
n_dim = 598
n_dim_meaningful = 3
n_dim_disp_extra = 2
# the actual betas used to generate the y values. the rest were 0.
betas_true = np.arange(n_dim_meaningful) + 1
# create a whole bunch of untrained models
models = create_models(alphas=(.01, .03, .1, .3, 1), l1_ratios=(.9, .7, .5))
#
all_results = results_df(models, betas_true, x_train, y_train, x_test, y_test, k=4)
# decide which columns we want to display
disp_cols = ["Beta " + str(i) for i in range(n_dim_meaningful + n_dim_disp_extra)]
disp_cols += ['Train Error', 'Cross Validation Error', 'Test Error']
# display the results
all_results[disp_cols]
Out[33]:
In [35]:
%matplotlib inline
import matplotlib.pyplot as plt
f = plt.figure()
plt.scatter(all_results['Cross Validation Error'], all_results['Test Error'])
plt.xlabel('Cross Validation Error')
plt.ylabel('Test Error')
f.set_size_inches(8, 8)
plt.show()
In [36]:
# scikit learn includes some functions for making cross validation easier
# and computationally faster for a some models
from sklearn import linear_model
model_ridge_cv = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
model_lasso_cv = linear_model.LassoCV(alphas=[0.1, 1.0, 10.0])
model_en_cv = linear_model.ElasticNetCV(l1_ratio=[.9], n_alphas=100)